scrape olx India website to get used car price data
Figure 1: scraping olx website for cars
Everyone has or will consider buying a car for various reasons. As for me, browsing through olx is a hobby and a passion that has been a part of my weekend fix for quite sometime now. Meaningless browsing through classifieds although joyful does not help us understand trends and patterns. Therefore, I decided to occasionally scrape the olx website for used car prices and make visualizations from the same. The primary objective was to have fun, and also grab some good deals when they present themselves.
In the first step, we load up the packages “tidyverse”, “httr”, and “rvest” to make sure that all the functions we call will work seamlessly. Now, I present to you the function “olxfind”.
olxfind<- function(area,yearstart, yearend, make){
link <- paste0("https://www.olx.in/",area,"/cars_c84?filter=first_owner_eq_1%2Cmake_eq_",make,"%2Cyear_between_",yearstart,"_to_",yearend)
page<- link |> session() |> read_html() # It is important to create a session first or else you may get a 403 error
prices<- page |> html_nodes("._3GOwr") |> html_text()
prices
yearmileage<- page |> html_nodes(".KFHpP") |> html_text()
yearmileage
# pic <- page |> html_attrs("img")
polo<- tibble(prices, yearmileage)
polo1<<- polo |> separate(yearmileage, into= c("year", "mileage"), sep = " - ") |>
mutate(mileage = str_remove_all(mileage, pattern = "km")) |>
mutate(mileage = str_remove_all(mileage, pattern = "\\.+0")) |>
mutate(mileage = str_remove_all(mileage, pattern = "[:punct:]")) |>
mutate(prices = str_remove_all(prices, pattern = "[:punct:]")) |>
separate(prices, into = c("symbol", "prices"), sep = " ") |>
select(year, mileage, prices) |>
mutate(across(where(is.character), as.numeric))
}
olxfind(area= "dehradun_g4059236", yearstart = "2014", yearend = "2020",make = "volkswagen")
This function takes the following arguments( all strings) i.e., area, yearstart & yearend, and make.
area is one of the most important arguments for this function. You need to tweak specify this argument accurately, if you want to get area-specific results. As shown in the function definition for olxfind, you can see that all the arguments used are primarily for the purpose of creating the pagelink that will be used to scrape the site.
Therefore, before running the function, you should ideally visit olx and from the area button, select the area of your choice. then from the url you will have to copy the string specifying the region of your choice in the function argument. for example, if I use only “dehradun” for the area argument, we will get a error. Since olx adds Dehradun as “dehradun_g4059236”, you need to specify that in the area argument. Suppose, you want to search for cars in delhi region then the link for olx becomes “https://www.olx.in/delhi_g4058659/cars_c84”. In this case, the area code for delhi is “delhi_g4058659”, you need to specify that in ther argument call for area.
Notice that the product call is “cars_c84”, which is already there in the link so you do not need to modify that from within the function. In case, you are interested in motorcycles(“motorcycles_c81”), or mobile-phones(“mobile-phones_c1453”).
You can also filter the cars based on the year of manufacture. This will certainly help you narrow down to the relevant results and filter the unnecessary information. Although year is a numeric variable, for the purposes of this function it is a string since its pasted into a string to form a link so make sure you write “2014” rather than 2014 in the function argument.
The Olx website provides you the option to select cars from various manufacturers. In olxfind you can get data for only one car manufacturer at a time. You can save the data from each call with the name of the manufacturer as a separate column and then use “dplyr::bind_rows” to join them together. This will ensure that you get the maximum number of listings from each manufacturer.
Right now, this function cannot be used to parse more than 40 entries, because of the design of the Olx website. If someone has any idea how to get all data points and bypass the “load more” button please share your insights in the comments section. I am also looking into the possibility of downloading the images associated with each data point to the database. In its present form, the function requires users to tweak a number of things if they want to look for other product types. Later, I might add some other conditional statements that will link with the “product” argument and create the relevant page links for users.
Now that we have the data at hand, we can probably conduct some exploratory visualizations on the same.
This graph reveals that there is some overalp between the used car prices across year of manufacturing. This can be attributed to factors like the model variant, number of previous owners and the colour of the car. However, that is a topic for another day, in this case we will make do with only the variables we hae at our disposal. Let’s draw another graph by summarizing the mean prices and mileage of each car grouped by the years. Here we find that the
| year | prices |
|---|---|
| 2014 | 418071.4 |
| 2015 | 524166.7 |
| 2016 | 515000.0 |
| 2017 | 489142.9 |
| 2018 | 590000.0 |
| 2019 | 490000.0 |
| 2020 | 855000.0 |
In this next plot we create a summary of the mean prices of used VW cars across the years of 2014-2019. We find that there is a sharp decrease in the prices of used VW cars after the first five years. There is negligible difference between the mean prices of cars that are six or seven years old. But there is an appreciable drop in car prices when the age is eight years. The biggest drop in prices occur when the cars age from five to six years old.
If you are looking for a used VW car, it might be better to go for a six year old or a eight year old car. I hope you found this article insightful. If you have any suggestions please feel free to share the same in the comments section.